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Abstract 

The CUR matrix decomposition is an important extension of Nystrom approximation 
to a general matrix. It approximates any data matrix in terms of a small number of its 
columns and rows. In this paper we propose a novel randomized CUR algorithm with 
^sO ■ an expected relative-error bound. The proposed algorithm has the advantages over the 

existing relative-error CUR algorithm that it possesses tighter theoretical bound and lower 
time complexity, and that it can avoid maintaining the whole data matrix in main memory. 
Finally, experiments on several real-world datasets demonstrate significant improvement 
over the existing relative-error algorithms. 

Keywords: Large-scale matrix computations, low-rank matrix approximation, CUR 
matrix decomposition, randomized algorithms 
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1. Introduction 
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Large-scale matrices emerging from stocks, genomes, web documents, web images and videos 
everyday bring new challenges in modern data analysis. Most efforts have been focused on 
manipulating, understanding and interpreting large-scale data matrices. In many cases, 
matrix factorization methods are employed to construct compressed and informative rep- 
resentations to facilitate computation and interpretation. A principled approach is the 
truncated singular value decomposition (SVD) which finds the best low-rank approxima- 
tion of a data matrix. Applications of SVD such as eigenface (Sirovich and Kirby, 1987, 
Turk and Pentland, 1991) and latent semantic analysis (Deerwester et al., 1990) have been 
illustrated to be very successful. 

However, the basis vectors resulting from SVD have little concrete meaning, which 
makes it very difficult for us to understand and interpret the data in question. An example 
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in (Drineas et al., 2008, Mahoney and Drineas, 2009) has well shown this viewpoint; that is, 
the vector [(l/2)age - (l/\/2)height + (1/2) income] , the sum of the significant uncorrelated 
features from a dataset of people's features, is not particularly informative. Kuruvilla 
et al. (2002) have also claimed: "it would be interesting to try to find basis vectors for all 
experiment vectors, using actual experiment vectors and not artificial bases that offer little 
insight." Therefore, it is of great interest to represent a data matrix in terms of a small 
number of actual columns and/or actual rows of the matrix. 

The CUR matrix decomposition provides such techniques, and it has been shown to be 
very useful in high dimensional data analysis (Mahoney and Drineas, 2009). Given a matrix 
A, the CUR technique selects a subset of columns of A to construct a matrix C and a subset 
of rows of A to construct a matrix R, and computes a matrix U such that A = CUR best 
approximates A. The typical CUR algorithms (Drineas, 2003, Drineas et al., 2006, 2008) 
work in a two-stage manner. Stage 1 is a standard column selection procedure, and Stage 2 
does row selection from A and C simultaneously. Thus, implementing Stage 2 is much more 
difficult than doing Stage 1. 

The CUR matrix decomposition problem is widely studied in the literature (Goreinov 
et al., 1997a,b, Tyrtyshnikov, 2000, Drineas, 2003, Drineas and Mahoney, 2005, Drineas 
et al., 2006, 2008, Mahoney and Drineas, 2009, Mackey et al., 2011, Hopcroft and Kannan, 
2012). Among the existing work, several recent work are of particular interest. Drineas 
et al. (2006) proposed a CUR algorithm with additive-error bound. Later on, Drineas et al. 
(2008) devised randomized CUR algorithms with relative error by sampling sufficiently 
many columns and rows. Particularly, the algorithm has (1 + e) relative-error ratio with 
high probability (w.h.p.). Recently, Mackey et al. (2011) established a divide-and-conquer 
method which solves the CUR problem in parallel. 

Unfortunately, all the existing CUR algorithms require a large number of columns and 
rows to be chosen. For example, for an m x n matrix A and a target rank k < min{m, n}, 
the state-of-the-art CUR algorithm — the subspace sampling algorithm in Drineas et al. 
(2008) — requires exactly C(/c 4 e -6 ) rows or 0(ke~ 4 log 2 k) rows in expectation to achieve 
(1 + e) relative-error ratio w.h.p. Moreover, the computational cost of this algorithm is at 
least the cost of the truncated SVD of A, that is, 0(min{mn 2 , nm 2 }). 1 The algorithms are 
therefore impractical for large-scale matrices. 

In this paper we develop a CUR algorithm which beats the state-of-the-art algorithm 
in both theory and experiments. In particular, we show in Theorem 9 a novel randomized 
CUR algorithm with lower time complexity and tighter theoretical bound in comparison 
with the state-of-the-art CUR algorithm in Drineas et al. (2008). 

The rest of this paper is organized as follows. Section 2 lists some notations that will be 
used in this paper and Section 3 reviews two classes of CUR algorithms. Section 4 mainly 
introduces a column selection algorithm to which our work is closely related. Section 5 de- 
scribes and analyzes our novel CUR algorithm. Section 6 empirically compares our proposed 
algorithm with the state-of-the-art algorithm. All proofs are deferred to Appendix B. 



1. Although some partial SVD algorithms, such as Krylov subspace methods, require only 0(mnk) time, 
they are all numerical unstable. See Halko et al. (2011) for more discussions. 
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2. Notations 

For a matrix A = [ajj] G R mxra , let aW be its z-th row and a, be its j-th column. Let 
||A||i = Ylij \ a ij\ bethel-norm, ||A|| F = ■ a^) 1 / 2 be the Frobenius norm, and ||A||2 = 
maxi| x i| 2= i || Ax||2 be the spectral norm. Moreover, let I m denote the mxm identity matrix, 
and denotes the zero matrix whose size dependents on the context. Let p = rank(A) and 
k < p, the SVD of A can be written as 

p 

o-A,iUA,iV Ai = U A S A V A = U A ,fc5]A,fcV AiA . + U A ,fciS Ai/:1 V Ajjti , 

i=0 

where U Ai t, ^A,k, an d V A ,fc correspond to the top k singular values. We denote A& = 
U Aj fcS Aj fcV A k . Furthermore, let A^ = Ua./jE^ 1 V A p be the Moore-Penrose inverse of A 
(Ben-Israel and Greville, 2003). 

Given matrices A G R mxn , X G R mxp , and Y G R qxn , XXtA = U X U X A G R mxn is 
the projection of A onto the column space of X, and AY^Y = AVyVy G R mxn is the 
projection of A onto the row space of Y. Finally, given an integer k < p, we define the 
matrix IIx,fe(A) G W nxn as the best approximation to A within the column space of X 
that has rank at most k. We have IIx,fe(A) = XZ where Z = argmin rank ( Z ) <fc || A — XZ||i?. 
We also have that ||A - XXtA|| F < ||A - n x ,fc(A)|| F . 

3. Previous Work in CUR Matrix Decomposition 

This section discusses two recent developments of the CUR algorithms. Section 3.1 intro- 
duces an additive-error CUR algorithm in Drineas et al. (2006), and Section 3.2 describes 
two relative-error CUR algorithms in Drineas et al. (2008). 

3.1 The Linear-Time CUR Algorithm 

The linear-time CUR algorithm is proposed by Drineas et al. (2006). It is a highly efficient 
algorithm. Given a matrix A and a constant k < rank(A), by sampling c = 64£:e -4 
columns and r = 4ke~ 2 rows of A and computing an intersection matrix U, the resulting 
CUR decomposition satisfies the following additive-error bound 

E||A - CUR|| F < ||A - A fc || F + e||A|| F . 

Furthermore, the decomposition also satisfies rank(CUR) < k. Here we give its main 
results (Theorem 4 of Drineas et al., 2006) in the following proposition. 

Proposition 1 (The Linear-Time CUR Algorithm) Given a matrix A G R mxn , we 
let pi = ||aW||2/||A|| F and qj = \\aj W2/W A|| F . The linear-time CUR algorithm randomly 
samples c columns of A with probabilities {(?j}" = i and r rows of A with probabilities {pi}^i- 
Then 

E||A - CUR|| F < || A - A fe || F + ((4fc/c) 1 / 4 + (fc/r) 1 / 2 ) ||A|| F . 

The algorithm costs 0(mc 2 + nr + c 2 r + c 3 ) time, which is linear in (m + n) by assuming c 
and r are constants. 
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3.2 The Subspace Sampling CUR Algorithm 

Drineas et al. (2008) proposed a two-stage randomized CUR algorithm which has a relative- 
error bound w.h.p. In the first stage the algorithm samples c columns of A to construct C, 
and in the second stage it samples r rows from A and C simultaneously to construct R and 
Tjt. In the first stage the sampling probabilities are proportional to the squared £ 2 - n orm 
of the rows of Va,^, in the second stage the sampling probabilities are proportional to the 
squared £2- norm of the rows of Uc k- That is why it is called the "subspace sampling 
algorithm". Here we show the main results of the subspace sampling algorithms in the 
following proposition. 

Proposition 2 (The Subspace Sampling CUR Algorithm) Given a matrix A £ W nx 
and an integer k <C min{m, n} ; the subspace sampling algorithm uses exactly sampling 
to select exactly c = 0(k 2 e~ 2 log(l/<5)) columns of A to construct C, and then exactly 
r = 0(c 2 e~ 2 log(l/<5)) rows of A to construct R ; or uses expected sampling to select 
c = 0(ke~ 2 log k log (1/5)) columns andr = 0(ce~ 2 log clog(l/<5)) rows in expectation. Then 
with probability at least (1 — S), 

\\A - CUR|| F < (1 + e)||A - A k \\ F . 

Here, the matrix U is a weighted Moore-Penrose inverse of the intersection between C and 
R. The running time of both algorithms is dominated by the truncated SVD of A. 

Although the algorithm is e-optimal with high probability, it requires too many rows 
get chosen: at least r = 0(ke~ A log 2 k) rows in expectation. In this paper we seek to devise 
an algorithm with mild requirement on column and row numbers. 

4. Theoretical Backgrounds 

Section 4.1 considers the connections between the column selection problem and the CUR 
matrix decomposition problem. Section 4.2 introduces a near-optimal relative-error column 
selection algorithm. Our proposed CUR algorithm is motivated by and partly based on the 
near-optimal column selection algorithm. 

4.1 Connections between Column Selection and CUR Matrix Decomposition 

Column selection is a well-established problem which has been widely studied in the lit- 
erature: (Frieze et al., 2004, Deshpande et al., 2006, Drineas et al., 2008, Deshpande and 
Rademacher, 2010, Boutsidis et al., 2011b, Guruswami and Sinop, 2012). 

Given a matrix A G W nxn , column selection aims to choose c columns of A to construct 
L, G K so that ||A - CCtA|| f achieves the minimum. Since there are (") possible 
choices of constructing C, so selecting the best subset is a hard problem. In recent years, 
many polynomial-time approximate algorithms have been proposed, among which we are 
particularly interested in those algorithms with relative-error bounds; that is, with c > k 
columns selected from A, there is a constant rj such that 

HA-CC+AHf < r]\\A-A k \\ F . 
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We call r\ the relative- error ratio. For some randomized algorithms, the inequality holds 
either w.h.p. or in expectation w.r.t. C. 

The CUR matrix decomposition problem has a close connection with the column selec- 
tion problem. As aforementioned, the first stage of existing CUR algorithms is simply a 
column selection procedure. However, the second stage is more complicated. If the second 
stage is naively solved by a column selection algorithm on A T , then the error ratio will 
trivially be 2rj. 

For a relative-error CUR algorithm, the first stage seeks to bound a construction error 

ratio of ^A-A fc ^ F > wnne * ne section stage seeks to bound ^| A ^^(^tA|[^ F §i ven C. Actu- 
ally, the first stage is a special case of the second stage where C = Given a matrix A, 
if an algorithm solving the second stage results in a bound ^|a~^^^j^ £ — ^ then this 
algorithm also solves the column selection problem for A with an n relative-error ratio. 
Thus the second stage of CUR is a generalization of the column selection problem. 



4.2 The Near-Optimal Column Selection Algorithm 

Recently, Boutsidis et al. (2011a) proposed a randomized algorithm which selects only 
c = 2/ce~ 1 (l + o(l)) columns to achieve the expected relative-error ratio (1 + e). Boutsidis 
et al. (2011a) also proved the lower bound of the column selection problem; that is, at least 
c = fee -1 columns are selected to achieve the (1 + e) ratio. Thus this algorithm is near 
optimal. Though an optimal algorithm recently proposed by Guruswami and Sinop (2012) 
achieves the the lower bound, the optimal algorithm is quite inefficient compared with the 
near-optimal algorithm. 

The near-optimal algorithm has three steps: the approximate SVD via random projec- 
tion (Halko et al., 2011), the dual set sparsification algorithm (Boutsidis et al., 2011a), and 
the adaptive sampling algorithm (Deshpande et al., 2006). Here we present the main results 
of this algorithm in Lemma 3. To better understand the algorithm, we also give the details 
of the three steps, respectively. 

Lemma 3 (Near-Optimal Column Selection Algorithm) Given a matrix A G M. mxn 
of rank p, a target rank k (2 < k < p), and < e < 1, there exists a randomized algorithm 
to select at most 

c=^(l + o(l)) 

columns of A to form a matrix C £ lR mxc such that 

E 2 ||A - CC+AHf < E||A - CC ] A\\ 2 F < (1 + e)||A - A fc ||f, ; 

where the expectations are taken w.r.t. C. Furthermore, the matrix C can be obtained in 
0{(mnk + nk 3 )e- 2 ^). 

The dual set sparsification algorithm requires the top k right singular vectors of A as 
inputs. Since SVD is time consuming, Boutsidis et al. (2011a) employed an approximation 
SVD algorithm (Halko et al., 2011) to speedup computation. We give the theoretical analysis 
of the approximation SVD via random projection in Lemma 4. The resulting matrix Z 
approximates Va^. 
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Lemma 4 (Randomized SVD via Random Projection) Given a matrix A G W nxn 
of rank p, a target rank k (k < p), and < eo < 1, the algorithm computes a factorization 
A = BZ T + E with B = AZ, Z T Z = l k , and EZ = such that 

E||E||f,< (l + e )||A-A fc |||,. 

The algorithm runs in 0{mnkeQ l ) time. 

The second step of the near-optimal column selection algorithm is the dual set sparsifi- 
cation proposed by Boutsidis et al. (2011a). When ones take A and the top k (approximate) 
right singular vectors of A as inputs, the dual set sparsification algorithm can determinis- 
tically selects c\ columns of A to construct Ci. We present their results in Lemma 5 and 
attach the concrete algorithm in Appendix A. 

Lemma 5 (Column Selection via Dual Set Sparsification Algorithm) Given a ma- 
trix A G M. mxn of rank p and a target rank k (< p), the dual set spectral- Frobenius sparsifica- 
tion algorithm deterministically selects c\ (> k) columns of A to form a matrix Ci G R mxc i 
such that 



A-n Clife (A) 



< 4 1 + 



1 



(i - y/kfc? 



A — Ai 



Moreover, the matrix C\ can be computed in Ty A k + 0{mn + nc\k 2 ), where Ty Ak is the 
time needed to compute the top k right singular vectors of A. 



After sampling c\ columns of A, the near-optimal column selection algorithm uses the 
adaptive sampling of Deshpande et al. (2006) to select C2 columns of A to further reduce 
the construction error. We present Theorem 2.1 in Deshpande et al. (2006) in the following 
lemma. 



Lemma 6 (The Adaptive Sampling Algorithm) Given a matrix A G W nxn , we let 

Ci G R mxc i consists of c\ columns of A, and define the residual B = A — CiC{A. Addi- 
tionally, for i = 1, ■ ■ ■ , n, we define 

Pi = llbilll/IIBUl,. 

We further sample C2 columns i.i.d. from A, in each trial of which the i-th column is chosen 
with probability pi. Let C2 G M mxc2 contain the C2 sampled rows and let C = [Ci,C2] G 
jmx(ci+c 2 ) _ Then, for any integer k > 0, the following inequality holds: 

E\\A - C&A\\ 2 F < \\A - A k f F + — ||A - CiC{A|||., 

where the expectation is taken w.r.t. C2. 
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Algorithm 1 The Fast CUR Algorithm. 



1: Input: a real matrix A £ R mx ™, target rank k, e £ (0, 1], target column number c = ^ (l+o(l)), 

target row number r = -^(l + o(l)); 

2: // Stage 1: select c columns of A to construct C £ R mxc 

3: Compute approximate truncated SVD via random projection such that Afe w UfeSfeV^; 

4: Construct U\ <— columns of (A — UfeSfeVfe); Vi columns of ; 

5: Compute si <— Dual Set Spectral-Frobenius Sparsification Algorithm (Ui, Vi, c— 2fc/e); 

6: Construct Ci <— ADiag(si), and then delete the all-zero columns; 

7: Residual matrix D <- A - CiCjA; 

8: Compute sampling probabilities: pi = ||di|||/||Dj||,, i = 1,- • • , n; 

9: Sampling ci — 2k/e columns from A with probability {p\, ■ ■ ■ ,p n } to construct C2; 

10: // Stage 2: select r rows of A to construct Re M rx " 

11: Construct U2 <— columns of (A — UfeSfeVfe) T ; V2 columns of U^; 

12: Compute S2 <— Dual Set Spectral-Frobenius Sparsification Algorithm (U2, V2, r — 2c/e); 

13: Construct Ri -s— Diag(s 2 )A, and then delete the all-zero rows; 

14: Residual matrix B <— A AR|Ri; 

15: Compute sampling probabilities: = ||b(^ |||/||B|||,, j = 1, • • • , m; 

16: Sampling r 2 = 2c/e rows from A with probability {gi, • • • , q m } to construct R2; 

17: return C = [Ci, C 2 ], R = [Rf, R^] 7 , and U = C t AR t . 



5. Main Results 

In this section we develop a novel CUR algorithm that we call the fast CUR algorithm 
due to its lower time complexity in comparison with SVD. We describe the procedure in 
Algorithm 1 and give theoretical analysis in Theorem 9. 

The main results of our work are formally shown in three theorems in this section. The 
proofs are deferred to Appendix B. Theorem 9 relies on Lemma 3 and Theorem 8, and 
Theorem 8 relies on Theorem 7. Theorem 7 is a generalization of Lemma 6, and Theorem 8 
is a generalization of Lemma 3. 

5.1 Adaptive Sampling 

The relative-error adaptive sampling algorithm is established in Theorem 2.1 of Deshpande 
et al. (2006). The algorithm is based on the following idea: after selecting a proportion of 
columns from A to form Ci by an arbitrary algorithm, the algorithms randomly samples 
additional C2 columns according to the residual A — CiCjA. Boutsidis et al. (2011a) 
used the adaptive sampling algorithm to decrease the residual of the dual set sparsification 
algorithm and obtained an (1 + e) relative-error ratio. Here we prove a new bound for the 
same adaptive sampling algorithm. Interestingly, this new bound is a generalization of the 
original one in Theorem 2.1 of Deshpande et al. (2006). In other words, Theorem 2.1 of 
Deshpande et al. (2006) is a direct corollary of our following theorem when C = A&. 

Theorem 7 (The Adaptive Sampling Algorithm) Given a matrix A G M. mxn anal a 
matrix C £ R mxc such that rank(C) = rank(CC t A) = p (p < c < n), we let R4 £ W lXn 
consist of r\ rows of A, and define the residual B = A — AR{Ri. Additionally, for 
i = 1, • • • , m, we define 

Pi = ||bW||2/|| B |||,. 
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We further sample r2 rows i.i.d. from A, in each trial of which the i-th row is chosen 
with probability pi. Let R2 G W" 2Xn contain the r2 sampled rows and let R = [R^, R^] 7 G 
j^(ri+r 2 )xri_ Then the following inequality holds: 

E||A - CCtARtRl^ < ||A - C&A\\ 2 F + — ||A - AR{Ri|||, 

^2 

where the expectation is taken w.r.t. R2. 



5.2 The Fast CUR Algorithm 

Based on the randomized SVD algorithm of Lemma 4, the dual set sparsification algorithm 
of Lemma 5, and the adaptive sampling algorithm of Theorem 7, we develop a randomized 
algorithm to solve the second stage of the CUR problem. We present the results of the 
algorithm in the following theorem. 

Theorem 8 (The Fast Row Selection Algorithm) Given a matrix A G ]R mxn and a 
matrix C G R mxc such that rank(C) = rank(CCtA) = p (p < c < n), and a target rank k 
(< p), the proposed randomized algorithm selects r = ^(1 + o(l)) rows of A to construct 
Ret rxn , such that 

E||A - CC+AR+RI^ < ||A - C&A\\ 2 F + e\\A - A k \\ 2 F , 

where the expectation is taken w.r.t. R. Furthermore, the matrix R can be computed in 
0((mnk + m/c 3 )e -2 / 3 ) time. 

Note that Lemma 3, i.e., Theorem 5 of Boutsidis et al. (2011a), is a special case of 
Theorem 8 when C = A&. Based on Lemma 3 and Theorem 8, we have the main theorem 
for the fast CUR algorithm as follows. 

Theorem 9 (The Fast CUR Algorithm) Given a matrix A G W nxn and a positive 
integer k <C min{m,n} ; the fast CUR algorithm described in Algorithm 1 randomly selects 
c = ^(l+o(l)) columns of A to construct C G R mxc with the near-optimal column selection 
algorithm of Lemma 3, and then selects r = ^p(l + o(l)) rows of A to construct R G R rxn 
with the fast row selection algorithm of Theorem 8. Then we have 

E||A - CUR|| F = E||A - C(C t AR t )R|| F < (1 + e)||A - A fc || F . 

Moreover, the algorithm runs in time O [ranker 2 ^ + (m + n)fc 3 e~ 2 / 3 + mk 2 e~ 2 + nk 2 e~^j . 

Since k,c,r <C min{m, n} by the assumption, so the time complexity of the fast CUR 
algorithm is lower than that of the SVD of A. This is the main reason why we call it the 
fast CUR algorithm. 

Another advantage of this algorithm is that it can avoid loading the whole m x n data 
matrix A into main memory. None of three steps — the randomized SVD, the dual set 
sparsification algorithm, and the adaptive sampling — requires loading the whole of A into 
memory. The most memory-expensive operation throughout the fast CUR Algorithm is 
computing the Moore-Penrose inverses of C and R, which requires maintaining an m x c 
matrix or an r x n matrix in memory. In contrast, the subspace sampling algorithm requires 
loading the whole matrix into memory to compute its truncated SVD. 
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Table 1: A summary of the datasets. 
Dataset Type size Source 

Rcdrock natural imagcl8000 x 4000 http://www.agarwala.org/efficient_gdc/ 
Edinburgh natural imagel6500 x 1800 http://www.agarwala.org/efficient_gdc/ 
Arcene biology 10000 x 900 http://archive.ics.uci.edu/ml/datasets/Arcene 
Dexter bag of words 20000 x 2600 http : //archive . ics . uci . edu/ml/datasets/Dexter 
PicasaWeb features 50000 x 3000https : //sites . google . com/site/picasawebdataset 



6. Empirical Analysis 

In this section we conduct empirical comparisons among the relative-error CUR algorithms 
on several datasets. We report the relative-error ratio and the running time of each algo- 
rithm on each data set. The relative-error ratio is defined by 

||A-CUR|| F 

Relative-error ratio = — n-_ _ — n , 

||A- A fc || F 

where A; is a specified target rank. 



6.1 Datasets 

We implement experiments on five datasets, including natural images, biology data, and 
bags of words. Table 1 briefly summarizes some information of the datasets. The Redrock 
and Edinburgh (Agarwala, 2007) are two large size natural images. Arcene and Dexter 
are both from the UCI datasets (Frank and Asuncion, 2010). Arcene is a biology dataset 
with 900 instances and 10000 attributes. Dexter is a bag of words dataset with a 20000- 
vocabulary and 2600 documents. PicasaWeb image dataset (Wang et al., 2012) contains 6.8 
million PicasaWeb images. We use the HMAX features (Serre et al., 2007) and the SIFT 
features (Lowe, 1999) of the first 50000 images; the features provided by Wang et al. (2012) 
are all of 3000 dimensions. Each dataset is actually represented as a data matrix, upon 
which we apply the CUR algorithms. 

When the data matrices become very large, e.g., say 8K x 3K, the truncated SVD and 
the standard SVD are both infeasible in our experiment environment, and so is the subspace 
sampling algorithm. Therefore we do not conduct experiments on larger data matrices. In 
contrast, our fast CUR algorithm actually works well even for 30 K x 3K matrices. 



6.2 Setup 

We implement the subspace sampling algorithm and our fast CUR algorithm in MATLAB 
7.10.0. We do not compare with the linear-time CUR algorithm for the following reason. 
There is an implicit projection operation in the linear-time CUR algorithm, so the result 
satisfies rank(CUR) < k. However, this inequality does not hold for the subspace sampling 
algorithm and the fast CUR algorithm. Thus, comparing the construction error among 
the three CUR algorithm is very unfair for the linear-time CUR algorithm. Actually, the 
construction error of the linear-time CUR algorithm is much worse than the other two 
algorithms. 
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Figure 1: Empirical results on the Redrock data set. 



We conduct experiments on a workstation with 12 Intel Xeon 3.47GHz CPUs, 12GB 
memory, and Ubuntu 10.04 system. According to the analysis in Drineas et al. (2008) and 
this paper, k, c, and r should be integers much less than m and n. For each data set and 
each algorithm, we set k = 10, 20, or 50, and c = ak, r = ac, where a ranges in each set 
of experiments. We repeat each set of experiments for 20 times and report the average and 
the standard deviation of the error ratios. The results are depicted in Figures 1, 2, 3, 4, 5, 
and 6. 



6.3 Result Analysis 



The results show that the fast CUR algorithm has much lower relative-error ratio than the 
subspace sampling algorithm. The experimental results well match our theoretical analyses 
in Section 5. As for the running time, the fast CUR algorithm is more efficient when c and 
r are small. When c and r become large, the fast CUR algorithm becomes less efficient. 
This is because the time complexity of the fast CUR algorithm is linear in e -4 and large c 
and r imply small e. However, the purpose of CUR is to select a small number of columns 
and rows from the data matrix, that is, c < n and r m. Thus we are not interested in 
the cases where c and r are large compared with m and n, e.g., say k = 20 and a = 10. 
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Figure 2: Empirical results on the Edinburgh data set. 
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Figure 3: Empirical results on the Arcene data set. 



11 



Running Time 


Subspace Sampling (Exactly) 

Subspace Sampling (Expected] 

Fast CUR 




i 




/ 

/ 

X 







2 4 6 8 10 12 14 16 18 20 22 24 26 28 



1.25 
1.2 

o 

cc 

6 1.1 

t 

tu 

0)1.05 
> 

- 1 
£ 

0.95 
0.9 



Construction Error (Frobenius Norm) 

Subspace Sampling (Exactly) 

Subspace Sampling (Expected) 
--Fast CUR 




"■Mi. 



26 28 

(a) fe = 10, c = afc, and r — ac 



2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 

a 



Running Time 


Subspace Sampling (Exactly) 

Subspace Sampling (Expected) 

Fast CUR 








r 











Running Time 



6 8 10 12 14 16 18 20 22 24 



Construction Error (Frobenius Norm) 



Subspace Sampling (Exactly) 

Subspace Sampling (Expected) 
-Fast CUR 




Subspace Sampling (Exactly) 

Subspace Sampling (Expected) 
Fast CUR 




/ 







2 4 6 



10 12 14 16 



Construction Error (Frobenius Norm) 



1.2 
1.15 

B 1-1 
C 1.05 

I 1 
UJ 

0.95 
> 

S 0.9 

II 0.85 



Subspace Sampling (Exactly) 

Subspace Sampling (Expected) 
-■Fast CUR 




(b) k = 20, c = ak, and r — ac. (c) k = 50, c = ak, and r = ac. 



Figure 4: Empirical results on the Dexter data set. 
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Figure 5: Empirical results on the HMAX features of the PicasaWeb image data set. 
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Figure 6: Empirical results on the SIFT features of the PicasaWeb image data set. 



7. Discussions 

In this paper we have proposed a novel randomized algorithm for the CUR matrix de- 
composition problem. This algorithm is faster, more scalable, and more accurate than the 
state-of-the-art algorithm, i.e., the subspace sampling algorithm. Our algorithm requires 
only c = 2/ce _1 (l+o(l)) columns and r = 2ce -1 (l-|-o(i)) rows to achieve (1+e) relative-error 
ratio. To achieve the same relative-error bound, the subspace sampling algorithm requires 
c = 0(ke~ 2 log k) columns and r = C(ce~ 2 logc) rows selected from the original matrix. 
Our algorithm also beats the subspace sampling algorithms in time-complexity. Our algo- 
rithm costs 0(mnke~ 2 / 3 + (m + n)/c 3 e~ 2 / 3 + mk 2 e~ 2 + nk 2 e~ 4: ) time, which is lower than 
C(min{ mn 2 , m 2 n}) of the subspace sampling algorithms when k is small. Moreover, our 
algorithm enjoys another advantage of avoiding loading the whole data matrix into main 
memory, which also makes our algorithm more scalable. Finally, the empirical comparisons 
have also demonstrated the effectiveness and efficiency of our algorithm. 

However, there are several open questions involving the lower bound of the CUR matrix 
decomposition problem. First, what is the lower bound for the CUR problem? Second, is 
there any algorithm achieving such a lower bound? Boutsidis et al. (2011b) proved a lower 

||A— CC^AH 2 h 

bound for the column selection problem: ^-n-r — ,J F > 1 + - . We thus wonder if there is a 

|[A— Afc||p c 

similar lower bound on the ratio ^^^1^ , e.g., say (1 + ranl ^ c ) ), We shall address 
these questions in future work. 



Acknowledgments 



13 



Algorithm 2 Deterministic Dual Set Spectral-Frobenius Sparsification Algorithm. 

1: Input: U = {x«}? = i C R l , (I < n); V = {v,}" = i C K fe , with £" =1 v * v f = I k (k < n); k < r < n; 
2: 

3: 



Initialize: so = 0, Ao = 0; 

Compute llxjllo for i = 1, • • • ,n, and then compute Sjj — ^- li=1 | x '^ 2 ; 

1 — yfc/r 

for r = to r — 1 do 

Compute the eigenvalue decomposition of A r ; 

Find an index j in {1, ■ ■ ■ ,n} and compute a weight t > such that 

vj(A T - (L T + l)I fe ) 2 Vj i 

« s r s ^ + i.A T )-^:,A r ) -^( A '- (L ' +i)it ) v - 

where 

k _ 1 
0(L,A)=^(Ai(A)-L) , L r = r-\/^; 

i=l 

Update the j-th component of s r and A T : s T+ i[j] = s T [j] + t, A r+ i = A T + tvjvj; 
end for 

l—Jkp 

return s = — s r . 



This work has been supported in part by the Natural Science Foundations of China (No. 
61070239) and the Google visiting faculty program. 

Appendix A. The Dual Set Sparsification Algorithm 

For the sake of completeness, we attach the dual set sparsification algorithm here and de- 
scribe some implementation details. The dual set sparsification algorithms are deterministic 
algorithms established in Boutsidis et al. (2011a). The fast CUR algorithm calls the dual 
set spectral- Frobenius sparsification algorithm (Lemma 13 in Boutsidis et al., 2011a) in both 
stages. We show this algorithm in Algorithm 2 and its bounds in Lemma 10. 

Lemma 10 (Dual Set Spectral-Frobenius Sparsification) Let U = {xi, -- ,x n } C 
R l (I < n) contain the columns of an arbitrary matrix X 6 R' xn . Let V = {vi, • • • , v n } C M. k 
(k < n) be a decompositions of the identity, i.e., Y17=i w i w J = ^-k- Given an integer r with 
k < r < n, Algorithm 2 deterministically computes a set of weights > (i = 1, • • ■ , n) at 
most r of which are non-zero, such that 

Afc(^SjVivf) > (l - d -) and tr( ^ SjXjxf) < ||X||^. 

i=l r i=l 

The weights Sj can be computed deterministically in 0(rnk 2 + nl) time. 

Here we would like to mention the implementation of Algorithm 2, which is not described 
by Boutsidis et al. (2011a) in details. In each iteration the algorithm performs once eigen- 
value decomposition: A T = WAW . Here A T is guaranteed to be positive semi-definite in 
each iteration. Since 

(A T - al k y = WDiag((Ai - a) q , ■ ■ ■ , (A fc - a) q ) W T , 
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we can efficiently compute (A T — (L T + l)Ifc) 9 based on the eigenvalue decomposition of A r . 
With the eigenvalues at hand, (p(L,A T ) can also be computed directly. 

The algorithm runs in r iterations. In each iteration, the eigenvalue decomposition 
of A r requires 0(k 3 ), and the n comparisons in Line 6 each requires 0(k 2 ). Moreover, 
computing \\xiW2 for each Xj requires 0(nl). Overall, the running time of Algorithm 2 is at 
most 0(rk 3 ) + 0(rnk 2 ) + 0(nl) = 0{rnk 2 + nl). 

Appendix B. Proofs 

B.l The Proof of Theorem 7 

Theorem 7 can be equivalently expressed in Theorem 11. In order to stick to the column 
space convention of Boutsidis et al. (2011a), we prove Theorem 11 instead of Theorem 7. 

Theorem 11 (Adaptive Sampling Algorithm) Given a matrix A G M. mxn and a ma- 
trix R G W xn such that rank(R) = rank(AR t R) = p (p < r <m), let Ci G R mxci consist 
of c\ columns of A, and define the residual B = A — CiC\A. For i = 1, ■ ■ ■ , n, let 

Pi = llbilll/IIBUl., 

where bj is the i-th column of the matrix B. Sample further C2 columns from A in C2 i.i.d. 
trials, where in each trial the i-th column is chosen with probability pi. Let C2 G M mxc2 
contain the C2 sampled columns and C = [Ci, C2] G M. mx( - Cl+C2 ^ contain the columns of both 
Ci and C2, all of which are columns of A. Then the following inequality holds: 

E||A - CC+AR+RI^ < ||A - AR+R|||. + — ||A - CiC{A|||.. 

C2 

where the expectation is taken w.r.t. C2. 

Proof With a little abuse of symbols, we use bold uppercase letters to denote matrix ran- 
dom variables and bold lowercase to denote vector random variables, without distinguishing 
between matrix/vector random variables and constant matrices/vectors. 

We denote the j-ih column of V AR tR p & W ixp as Vj, and the (i, j)-th entry of V AR t R „ 
as Vij. Define vector random variables x-j^i) G M m such that for j = 1, ■ ■ ■ , n and I = 

x i,(0 = ~ D * = ~ ( a i — CiC}ai) with probability pi, for i = 1, ■ • ■ , n, 

Pi Pi ^ ' 

Note that Xj^i) is a linear function of a column of A sampled from the above defined 
distribution. We have that 

n 

e [ x j,(q] = E^Tr bj : 
•=1 Pl 

i=l ^ 



Bv,, 



|B|||. 



15 



Then we let Xj = ^ Y11=i x i,(0' we nave 



Bv 



Ellx,- - Bv 



J 112 



E 



E[x, 



Ie 

C2 



x i,(0 - e [ x j,(/)] 



— Ellx 

C2 



: i,(0 



Bv 



J 1 1 2 - 



According to the construction of Xi,--- , x p , we define the C2 columns of A to be C2 € 
E mxc2 . Note that all the random variables Xi • • • ,x p lie in the subspace span(Ci) + 
span(C2). We define random variables 

Wj = CiC| AR^Rvj + Xj = CiCjAvj + xj, for j = 1, • • • , 

where the second equality follows from Lemma 12 that ARtRy, = Avj if Vj is one of 
the top p right singular vectors of ARtR. Then we have that any set of random variables 
{wi, • • • ,Wp} lies in span(C) = span(Ci) + span(C2). Let W = [wi, 
random variable, we have that span(W) C span(C). The expectation of Wj is 

E[wj] = CiC} A Vj + E[xj] = CiC}Avj + B Vj = Ay,-, 
therefore we have that 



, w p ] be a matrix 



Wj - A Vj 



Xi 



The expectation of ||wj — AvjHl is 



Ellw,- - Av 



3W2 



Ellx,- - Bv 



j\\2 



Bv, 



—E||x 

C2 



Bv 



j\\2 



^E||x j)(J) ||l - ^-(B Vj -) T E[x j)(0 ] + ^IIBvjlH 



C2 
1 

= —Ex 



C2 

< -||B| 

C2 



: i,(0ll2 



C2 



C2 



C2 



|Bv 



3W2 



IBv 



C2 



C2 



J'll2 



To complete the proof, we let the matrix variable 



(1) 



F = E^w.u^ARtR, 

q=l 

where a q is the g-th largest singular value of ARtR and u,j is the corresponding left singular 
vector of AR^R. The column space of F is contained in span(W) C span(C), and thus 

||AR f R - CC+AR+RH^ < ||AR f R - WW^ARtRHl < || AR f R - F\\ 2 F . 

We use F to bound the error || ARtR - CCtARtR||f, : 



EIIA-CC+AR+RH 



E|| A - ARtR + AR f R - CC f AR+RH] 



E 



I A - ARtRlli + || ARtR - CCtARtR|| 



(2) 



< || A - ARtRll p + Ell ARtR — F| 



Fi 
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where (2) follows from that A (I - RtR) is orthogonal to (I - CCt)ARtR. Since ARtR 
and F both lies on the space spanned by the right singular vectors of ARtR, i.e. {vj}j =1 , 
we decompose ARtR — F along {vj}j =1 : 



E||A- CCtARtRHl < 



< 



A - AR f R||^ + E|| AR f R - F||f,, 



A - AR f R||^ + J^E (AR+R - F)v 3 

3=1 

P P 

A - ARtR|||, + ^E ARtRvj - a^WqU^ajUj 

3=1 9=1 
P 

A - AR f R||^ + AR'Rvj - Wj 

3=1 
P 

A - AR f R||^ + ^2E\\Avj - Wj 

3=1 

A - AR^R|||, + — IIBIIp, 
c 2 



|2 
v 3\\2 



(3) 
(4) 



where (3) follows from Lemma 12 and (4) follows from (1). 



Lemma 12 We are given a matrix A G W nxn and a matrix R G W xn such that rank(ARtR) = 
rank(R) = p (p < r <m). Letting Vj G W 1 be the j-th top right singular vector of ARtR, 
we have that 

AR f R Vi = Avj, for j = 1, ■ ■ ■ , p. 

Proof First let Vr iP G R nx f contain the top p right singular vectors of R, then the 
projection of A onto the row space of R is ARtR = AV^ p V| p . Let the thin SVD of 
AV RiP G R mx P be USV T , where V G W x p. Then the compact SVD of ARtR is 

ARtR = AVr^V^ = USV r V^. 

According to the definition, Vj is the j-th. column of (Vr jP V) G W ixp , and thus Vj lies on 
the column space of Vr iP , and Vj is orthogonal to Vr iP j_. Finally, since A — ARtR = 
AVr )P iV^ x , we have that Vj is orthogonal to A — ARtR, that is, (A — ARtR)vj = 0, 
which directly proves the lemma. ■ 



B.2 The Proof of Theorem 8 

Boutsidis et al. (2011a) proposed a randomized algorithm which achieves the expected 
relative-error bound in Lemma 13. This algorithm is described in Line 3 to 6 of Algorithm 1. 
Lemma 13 is a direct corollary of Lemma 4 and Lemma 5. If we apply the same algorithm 
to A to select c rows of A to form Ri, that is, Line 11 to 13 of Algorithm 1, then a very 
similar bound is guaranteed. 
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Lemma 13 (Boutsidis et al. (2011a), Theorem 4) Given a matrix A G W nxn of rank 
p, a target rank 2 < k < p, and < eo < 1, there is a randomized algorithm to select c\> k 
columns of A and form a matrix Ci G IR mxCl such that 



E|| A - dCjAHl < (1 + e ) (l + )==., 



|A — A 



k\\F> 



where the expectation is taken w.r.t. C\. The matrix C\ can be computed in O^nke^ 1 + 
nc\k 2 ) time. 

With Theorem 7 and Lemma 13, we now prove Theorem 8 as follows. 
Proof This randomized algorithm has three steps: approximate SVD via randomized 
projection (Halko et al., 2011), deterministic column selection via dual set sparsification 
algorithm (Boutsidis et al., 2011a) shown in Lemma 5, and the adaptive sampling algorithm 
of Theorem 7 proved in this paper. This algorithm is a generalization of the near-optimal 
column selection algorithm of Lemma 3. 

Given A G jj mxri anc j a target ran k k < r±, step 1 (Line 3 of Algorithm 1) compute 
an approximate truncated SVD of A in O(mnk/eo) time such that A& « A& = U^SfeV^. 
Lemma 4 shows that 

E||A - UfclVVf ||| < (1 + eo)||A - A k f F . 

Step 2 (Line 11 to 13 of Algorithm 1) selects r\ rows of A to construct Ri by the dual 
set sparsification algorithm taking U and V as input, where U contains all the m columns 
of (A T — AjT) G M. nxm , V contains all the m columns of k G R kxm . Lemma 13 shows 
that 



E||A- AR}Ri|||. < (l + e )(l + 



|A- A 



fcllF) 



where the expectation is taken w.r.t. Ri. Step 2 costs 0(rnr\k 2 + ran) time. 

Step 3 (Line 14 to 16 of Algorithm 1) samples additional r2 rows of A to construct R2 G 



R r2X " by the adaptive sampling algorithm of Theorem 7. Let R 
We apply Theorem 7 and have that 



[R^ , R' 



TXT 



E R ||A-CCtARtR||| = E Rl 



< E Rl 



^R 2 
IA- 



| A - CC+ARtRHl, 



Ri 



CC+AN^ + —IIA - ARlR 



T2 



1 



< ||A-CC t A|| F + -^(l + e )fl + ; ) 



By setting n = 0(ke 2 / 3 ), r 2 



— , and eo = e 2 / 3 , we conclude that 



|A- A k f F . 



Ell A - CC^AR+RHl < || A - C^AIli + ell A 



AfcUl-. 



The total computation time of the three steps is O(mnk/eo + mr\k 2 + mn) = 0{{mnk + 
mfc 3 )e" 2 / 3 ) ■ 
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B.3 The Proof of Theorem 9 



Proof Since C is constructed by columns of A, the column space of C is contained in 
the column space of A, so rank(CC T A) = rank(C) = p < c, and thus the assumptions of 
Theorem 8 are satisfied. Lemma 3 and Theorem 8 together prove Theorem 9: 



E 2 ||A 



CURI 



< E||A-CUR||| 



e c> r||a- ccTarTrh! 



E c E R ||A - CCtARtRH 
E c |"||A-CC t A||!. + e||A 



< 

< (1 + 26)|| A 



AjfcHiT 



A fc || 2 fe . 



Finally we have E||A - CUR|| F < (1 + e)||A - A fe || fe because 1 + 2e < (1 + e) 2 . 

The time cost of the fast CUR algorithm is the sum of Stage 1, Stage 2, and the Moore- 
Penrose inverse of C and R, i.e. 0((mnk + nfc 3 )e~ 2 / 3 ) + 0((mnk + m/c 3 )e~ 2 / 3 ) + 0(mc 2 ) + 
0(nr 2 ) = 0(mnke- 2/3 + (m + n)A; 3 e" 2/3 + mk 2 e~ 2 + nk 2 e~ A ). ■ 
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